View Posters By Category

Wednesday, November 9, 2022 between 8:30 AM - 9:30 AM	Thursday, November 10, 2022 between 8:30 AM - 9:30 AM
DREAM In-person DREAM Virtual	RSG In-person

Friday, November 11, 2022 between 8:30 AM - 9:30 AM
RSG Virtual

Virtual: A surprising loss for a Recurrent Neural Networks

COSI: dream

Michele Tinti, The Wellcome Centre for Anti-Infectives Research School of Life Sciences University of Dundee, United Kingdom

Presentation Overview: Show

Virtual: DREAM Challenge 2022 Predicting gene expression using millions of random promoter sequences by Team Wan&Barton_BBK

COSI: dream

Ibrahim Alsaggaf, Birkbeck, University of London, United Kingdom
Patrick Greaves, Birkbeck, University of London, United Kingdom
Carl Barton, Birkbeck, University of London, United Kingdom
Cen Wan, Birkbeck, University of London, United Kingdom

Presentation Overview: Show

In this competition, we proposed a modified Temporal Convolutional Networks (TCN) and a new loss function to train our predictive model for predicting the expression profiles of random promoter sequences. In general, TCN is constructed as a type of stacked neural networks, where each hidden layer has the same length as the input layer in order to guarantee that the prediction for the target time point depends on all previous time points' information. The well-known dilated convolutional operation was also exploited in order to expand the receptive field when coping with long input sequences. The residual connection, weight normalization, and spatial dropout mechanisms were also used to construct the convolutional layer. In our TCN model, the input promoter sequence's nucleotides were encoded as a type of character-level embeddings, which are parameterized as the learnable first level of TCN. The final output is generated by a linear layer that was added on top of the last residual block in order to predict the expression profiles by using the learned feature representations of the entire target promoter sequence's nucleotide combinations. We also proposed a new loss function for training our TCN model for this competition. The new loss function consists of Mean Squared Error (MSE), Pearson correlation, and Spearman correlation values, whilst a weight for the MSE is applied in order to reflect the importance of those two types of correlations.

We used the preprocessed training dataset (including 6,728,720 sequences) to train our model. As TCN can be trained on variable length inputs, our model is trained on 105,123 batches that were sequentially created by 23 bins, where each batch holds sequences of the same length. We used the weekly leaderboard testing sequences to briefly estimate the predictive performance of our model. Because of the highly noisy distribution of the training dataset, we believe using all training sequences would lead to the best model generalizability w.r.t. the testing sequences. Note that, due to the potential bias of the leaderboard testing sequences (since they were randomly sampled as ∼13% of the entire testing dataset), we didn't adopt any early-stopping strategy based on the model performance evaluated by using the leaderboard testing sequences, which were nevertheless used to conduct a hyperparameters optimization. According to the 1st stage leaderboard results, our model successfully outperformed two benchmark methods.

Virtual: Predicting Gene Expression Using a Residual CNN

COSI: dream

Fredrik Svensson, University College London, United Kingdom
Maria-Anna Trapotsi, University of Cambridge, United Kingdom
Susanne Bornelöv, University of Cambridge, United Kingdom

Presentation Overview: Show

Our study describes the "Camformers" team's submission to the DREAM challenge "Predicting gene expression using millions of random promoter sequences". The objective of the challenge was to predict reporter gene expression in S. cerevisiae based on a 110 nucleotide (nt) DNA sequence representing a minimal promoter. The sequences consisted of an 80 nt variable region embedded between two fixed sequences of length 17 and 13 nt, respectively. In total, 6,739,258 training examples were available and used for modelling if they fulfilled quality criteria based on the expected length and number of undetermined bases in the sequence. We used one-hot-encoded sequences for model input. To identify the best performing model, different architectures were created in both PyTorch and TensorFlow, and explored by varying model parameters and hyperparameters using Optuna. In our hands, the best and most stable performance was obtained in PyTorch using a convolutional neural network (CNN) with residual connections. The model included six convolutional layers with three residual connections allowing the model to bypass every other layer. Batch normalisation and dropouts were used for regularisation and a max pooling operation was added after the penultimate convolutional layer to reduce model size and improve generalisation. The output of the final convolutional layer was flattened into 13,312 features and fed into a block of two dense layers outputting 256 features, followed by a final dense layer outputting the predicted expression level. All layers except the last used a rectified linear unit activation. The whole model had 16,611,073 trainable parameters and total runtime from raw data to predictions was ~16 hours using one core on a Google Cloud TPU v3-8. Model performance during training was r=0.748 (r²=0.560) and ρ=0.765 on our internal validation set (10%). Final performance on the challenge leaderboard (13% of external test data) was r=0.962 (r²=0.926), ρ=0.967, ScorePearsonR²=0.763, and ScoreSpearman=0.823, outperforming previously published methods. In the final challenge, our team "Camformers" finished in 4th place. In addition to presenting our final submission, we will also describe our insights from design and optimisation of alternative architectures, including CNN and transformer models.

Posters - Schedules

View Posters By Category

Wednesday, November 9, 2022 between 8:30 AM - 9:30 AM

Thursday, November 10, 2022 between 8:30 AM - 9:30 AM

Friday, November 11, 2022 between 8:30 AM - 9:30 AM